Block-Based Approaches to Learning Ranking Functions with Application to Protein Homology Prediction
نویسندگان
چکیده
In many information retrieval systems such as Web search engines and biological-sequence search engines, the ranking functions that list the search results in order of their relevances to the query are one of the most important components. In the machine learning approaches to constructing ranking-functions, the feature vectors of database items are computed based on queries and thus they are grouped into blocks by queries. However, few existing algorithms take into account the block structure of data when learning a ranking function. This paper describes a series of approaches for more accurate learning of ranking functions by exploiting the block structure of data and applies these approaches to the protein homology prediction problem, a key step of protein structure prediction in bioinformatics. These approaches range from data normalization and reduction to learner training. The data reduction methods, including block selection and support vector under-sampling, contributed to our original winning of the protein homology prediction task in the ACM KDDCUP-2004 competition. By extending the block-selection method to a query-adaptive version and using an ensemble-learning approach, a novel ranking-function learning algorithm named K-Nearest-Block (KNB) Ensemble Ranking is proposed. In this algorithm, given the data block derived from a new query, only those most similar data blocks in training data are used to learn a ranking function. Experiments with the support vector machine (SVM) used as the benchmark learner demonstrate that all the proposed block-based approaches can significantly improve the ranking performance of SVMs. Especially, the KNB SVM ensemble performs so far most accurately overall on the blinded test data set of the KDDCUP-2004 protein homology prediction problem.
منابع مشابه
Query-Adaptive Ranking with Support Vector Machines for Protein Homology Prediction
Protein homology prediction is a crucial step in templatebased protein structure prediction. The functions that rank the proteins in a database according to their homologies to a query protein is the key to the success of protein structure prediction. In terms of information retrieval, such functions are called ranking functions, and are often constructed by machine learning approaches. Differe...
متن کاملProtein Secondary Structure Prediction: a Literature Review with Focus on Machine Learning Approaches
DNA sequence, containing all genetic traits is not a functional entity. Instead, it transfers to protein sequences by transcription and translation processes. This protein sequence takes on a 3D structure later, which is a functional unit and can manage biological interactions using the information encoded in DNA. Every life process one can figure is undertaken by proteins with specific functio...
متن کاملApplication of learning to rank to protein remote homology detection
MOTIVATION Protein remote homology detection is one of the fundamental problems in computational biology, aiming to find protein sequences in a database of known structures that are evolutionarily related to a given query protein. Some computational methods treat this problem as a ranking problem and achieve the state-of-the-art performance, such as PSI-BLAST, HHblits and ProtEmbed. This raises...
متن کاملارائه الگوریتمی مبتنی بر یادگیری جمعی به منظور یادگیری رتبهبندی در بازیابی اطلاعات
Learning to rank refers to machine learning techniques for training a model in a ranking task. Learning to rank has been shown to be useful in many applications of information retrieval, natural language processing, and data mining. Learning to rank can be described by two systems: a learning system and a ranking system. The learning system takes training data as input and constructs a ranking ...
متن کاملQUASAR - scoring and ranking of sequence-structure alignments
SUMMARY Sequence-structure alignments are a common means for protein structure prediction in the fields of fold recognition and homology modeling, and there is a broad variety of programs that provide such alignments based on sequence similarity, secondary structure or contact potentials. Nevertheless, finding the best sequence-structure alignment in a pool of alignments remains a difficult pro...
متن کامل